AI Agent Infrastructure Audit — EU AI Act Compliance

D1

AI AGENT INFRASTRUCTURE & SYSTEMS AUDIT

Systems

What makes this different from a standard IT audit: You are not just auditing servers — you are auditing an autonomous software stack that makes decisions and invokes real-world actions. Every component in the chain is a potential attack or failure surface.

AI agent stack — what the auditor must map

Agent infrastructure — top-down component map

[U]

User / Trigger

Human input or automated event

▶

Input

[O]

Orchestrator

LangChain / AutoGen / CrewAI

▶

Prompt

[L]

LLM API

GPT-4 / Claude / Gemini

▶

Tool call

[T]

Tool Layer

APIs, DBs, file system, web

▶

Context

[M]

Memory

Vector DB / context store

▶

Output

[R]

Response

Action or returned answer

Infrastructure audit — test procedure per component

Test 1

Orchestration layer review

Framework version pinned?Dependency SCA scan run?Config stored in version control?No hardcoded prompts in source?

Test 2

LLM API connection security

API keys stored in secrets manager?Model version pinned (no auto-upgrade)?Rate limits and spend caps enforced?DPA signed with LLM provider?

Test 3

Tool integration inventory

Full list of tools agent can invoke?Each tool documented with purpose?Unused tools disabled?Any tool with write/delete access flagged?

Test 4

Memory & vector DB controls

Access control on vector DB?PII not stored in context store?Retention policy on memory defined?Data poisoning prevention in place?

Test 5

Network segmentation

Agent process isolated in own network segment?Egress filtering on outbound tool calls?Agent cannot reach internal admin systems?

Test 6

Supply chain & dependencies

All packages locked in requirements file?SBOM (software bill of materials) exists?Known CVEs scanned and patched?No unpinned pip install in CI/CD?

Outputs → Agent Infrastructure Map Systems Inventory Integration Risk Register Supply Chain Assessment

D2

AI ACTIONS & TOOL USE AUDIT

AI Behaviour

The most novel audit domain. Traditional IT audits test whether humans followed procedures. Here you are testing whether an autonomous agent acted within permitted boundaries — without a human approving every step.

Tool permission matrix — what agents are allowed to invoke

Tool / Action

Read

Write

Delete

External Call

Financial

Web search

✓

✗

✓

✗

Database query

✓

HITL

✗

File system

✓

HITL

✗

Send email / message

✓

HITL

✗

✓

✗

Payment / transaction

✗

HITL

Key: ✓ Permitted autonomously | ✗ Blocked by policy | HITL = Human-in-the-loop approval required before execution

AI actions audit — test procedure

Test 1

Tool boundary enforcement

Are tool permissions enforced in code, not just policy?Can agent bypass permission by chaining tools?Test: attempt unauthorised action — is it blocked?

Test 2

Human-in-the-loop (HITL) checkpoints

All destructive actions require HITL?HITL cannot be bypassed by prompt instruction?Timeout on pending HITL — agent stops, not proceeds?

Test 3

Prompt injection resistance

Agent tested against indirect prompt injection?Web-retrieved content cannot override system prompt?Input sanitisation on all external content?

Test 4

Multi-agent trust chains

Sub-agents cannot exceed orchestrator permissions?Inter-agent communication authenticated?Rogue sub-agent cannot elevate privilege?

Test 5

Rate limiting & scope caps

Max tool calls per session defined?Max tokens / cost per run capped?Runaway loop detection in place?Spend cap enforced at API gateway level?

Test 6

Output sanitisation

Agent output filtered before downstream use?PII stripped from outputs?Code execution output sandboxed?

Outputs → Tool Permission Matrix Action Risk Register HITL Gap Analysis Injection Test Results

D3

AUDIT TRAIL & OBSERVABILITY

Traceability

The EU AI Act requires automatic logging of high-risk AI operations. Without a complete, tamper-evident audit trail, you cannot investigate incidents, prove compliance, or explain decisions to regulators.

Complete log chain — every hop must be captured

Audit trail — required log events at each stage

[1]

Session Start

User ID, timestamp, trigger source

▶

[2]

Prompt Log

System prompt + user input hash

▶

[3]

LLM Decision

Model, version, token count, response

▶

[4]

Tool Invocation

Tool name, params, result, latency

▶

[5]

HITL Event

Approver ID, decision, timestamp

▶

[6]

Final Output

Action taken or response returned

Audit trail — test procedure

Test 1

Log completeness

Every tool call has a matching log entry?No gaps between session start and end?Replay 5 sessions — do logs reconstruct fully?

Test 2

Tamper-evidence

Logs written to write-once / append-only store?Hash chain or WORM storage enforced?No admin can delete logs without dual approval?

Test 3

Retention compliance

Retention period defined per data type?High-risk AI logs retained minimum required period?Automated deletion policy enforced at expiry?

Test 4

Searchability & incident response

Logs indexed by session ID, user, tool, time?Can reconstruct full decision chain in <30 min?Tested during tabletop incident exercise?

Test 5

PII handling in logs

PII in prompts masked / pseudonymised?Log access restricted to authorised personnel?Raw prompt content not stored in plain text?

Outputs → Observability Gap Report Log Architecture Diagram Retention Compliance Check Incident Replay Test Result

D4

DATA GOVERNANCE & PRIVACY

GDPR · Data

AI agents process personal data differently from traditional systems. Context windows, RAG pipelines, and memory stores create new data flows that standard GDPR assessments miss entirely.

AI-specific data flows — each requires a lawful basis and DPA

Data entering the agent

User input (may contain PII)
RAG retrieval from knowledge base
Tool responses (CRM, DB, email data)
Memory recall from prior sessions
System prompt (may embed user profile)

Data processed by LLM

Full context window sent to external LLM
Fine-tuning datasets (if applicable)
Embedding generation for vector storage
Intermediate reasoning chains
Every token sent = a data transfer

Data exiting the agent

Agent outputs stored in logs
Actions written to downstream systems
Memory persisted to vector DB
Reports or emails sent externally
Embeddings retained indefinitely?

Test 1

LLM provider DPA

Signed DPA with OpenAI / Anthropic / Google?Data processing terms reviewed by legal?Provider certified for EU data processing?

Test 2

PII detection before LLM

PII scanner on all inputs before API call?Names, emails, IDs masked or tokenised?Medical / financial data never sent to LLM?

Test 3

RAG data governance

RAG source documents classified by sensitivity?Access control on vector DB by user role?Deleted docs removed from embeddings too?

Test 4

Right to erasure (GDPR Art. 17)

Can a user's data be fully deleted from logs?Embeddings containing PII can be removed?Test: submit erasure request — verify end-to-end

Test 5

Training data provenance

Fine-tuning data has documented lawful basis?No customer data used in training without consent?Data lineage documented end-to-end?

Outputs → AI Data Flow Map GDPR Gap Report Vendor DPA Register Erasure Test Evidence

D5

EU AI ACT COMPLIANCE ASSESSMENT

Regulatory

First task: classify every AI agent by risk tier. The tier determines everything — obligations, deadlines, and whether third-party conformity assessment is required. Misclassification is itself a compliance failure.

Step 1 — classify each AI agent by risk tier

Banned

Prohibited Feb 2025

Biometric surveillance in publicSocial scoringSubliminal manipulationExploitation of vulnerable groups

Audit action

Confirm not deployed
Document classification decision

High Risk

Deadline: Aug 2026

CV screening agentsCredit risk agentsMedical decision agentsFraud detection (financial)Critical infrastructure ops

Full compliance suite

All 8 obligations apply
Conformity assessment required
EU AI database registration

Limited

Deadline: Aug 2026

Customer service chatbotsAI writing assistantsContent generation agents

Transparency only

Disclose AI interaction
Label AI-generated content

Minimal

No deadline

Internal knowledge agentsCode assistants (internal)Summarisation tools

Voluntary best practice

Document classification
Apply voluntary code

Step 2 — test all 8 EU AI Act obligations for high-risk agents

1. Risk management system

Documented, iterative risk process covering full AI lifecycle

Risk register exists? Updated at each model change? Reviewed by accountable owner?

2. Data governance

Training data documented, bias-examined, quality-checked

Data lineage documented? Bias metrics computed and within thresholds? Signed off?

3. Technical documentation

Complete technical file covering capabilities, limitations, architecture

Technical file current? Version-controlled? Accessible to regulators within 72hr?

4. Automatic logging

AI system generates automatic logs of operations for traceability

Logs auto-generated? Capture all decisions? Retained for required period? Tamper-evident?

5. Transparency to users

Users informed they are interacting with an AI system

Disclosure present before first interaction? Clear and prominent? Not buried in T&Cs?

6. Human oversight

Human operators able to monitor, intervene, stop the system

Kill switch tested? Override mechanism documented? Oversight role assigned and trained?

7. Accuracy & robustness

Consistent performance, resilience to errors and adversarial inputs

Accuracy metrics documented? Adversarial testing done? Edge case behaviour defined?

8. Conformity assessment

Self-assessment or third-party audit + EU AI database registration

Assessment completed? Registered in EU AI database? CE marking applied if required?

Step 3 — establish provider vs deployer obligations split

PROVIDER — built the AI system

Obligations of the system builder

Technical documentation and data governance
Conformity assessment before market placement
EU AI database registration
CE marking on high-risk systems
Post-market monitoring and incident reporting

DEPLOYER — uses the AI system in a specific context

Obligations of the user organisation

Human oversight measures implemented
Input data quality and relevance maintained
Users informed of AI interaction
Fundamental rights impact assessment (FRIA)
Cannot deploy in ways that exceed intended purpose

Outputs → Risk Tier Classification EU AI Act Gap Analysis Conformity File Readiness Provider / Deployer Split

D6

ACCESS, IDENTITY & PRIVILEGE CONTROLS

Access

AI agents are identities. A service account that runs an AI agent must be treated with the same rigour as a privileged human user — potentially more, since it can act autonomously at machine speed.

Test 1

Agent service account review

Every agent has a dedicated service account?Least privilege enforced on each account?No shared credentials across multiple agents?No human account used as agent identity?

Test 2

Secrets management

All API keys stored in vault (not env vars)?Key rotation enforced — max 90 day lifetime?No secrets in source code or config files?Secret scanning in CI/CD pipeline?

Test 3

Deployment pipeline access

Who can deploy or modify agent config?MFA enforced on deployment pipeline?Change approval required before prod deployment?

Test 4

Agent-to-agent authentication

Sub-agents must authenticate to orchestrator?No implicit trust between agent processes?Token-based auth with short expiry?

Test 5

Access reviews

Agent service accounts reviewed quarterly?Unused agents decommissioned and deprovisioned?Access review evidence retained?

Outputs → Agent Access Matrix Secrets Management Review Privilege Escalation Risk Report

D7

INCIDENT RESPONSE & FAILURE MODES

Risk

AI incidents are different from standard IT incidents. Hallucination, prompt injection, runaway loops, and cascading multi-agent failures require dedicated playbooks — and the EU AI Act requires serious incident reporting to regulators.

AI-specific failure modes the auditor must test

Hallucination

Confident incorrect output

Detection: output validation layer?
Containment: human review before high-stakes action?
Is hallucination rate benchmarked and tracked?

Runaway Loop

Agent calls itself recursively

Max iteration limit enforced in code?
Token / cost cap as secondary kill?
Loop detection alert fires within 60s?

Prompt Injection

Malicious instruction via input

Indirect injection via retrieved content?
System prompt cannot be overridden?
Adversarial test suite run regularly?

Cascading Failure

Multi-agent chain collapse

Failure in one agent isolated from others?
Circuit breaker pattern implemented?
Partial failure state handled gracefully?

Model Drift

LLM behaviour changes unexpectedly

Model version pinned — no silent upgrades?
Regression tests run on model update?
Rollback procedure tested and documented?

Unauthorised Action

Agent acts outside permitted scope

Kill switch tested and working?
Alert on out-of-scope tool invocation?
Rollback for side effects of actions?

Incident response — test procedure

Test 1

Kill switch test

Emergency stop tested in staging within last 90 days?Kill switch accessible to non-technical staff?Agent cannot restart itself after kill?

Test 2

AI incident classification

AI incidents have their own severity taxonomy?Hallucination = classified as incident?Runaway loop = P1 incident automatically?

Test 3

EU AI Act serious incident reporting

Serious incident definition per EU AI Act known?Regulator notification process documented?Can notify regulator within required timeframe?

Test 4

Rollback and remediation

Model rollback tested — time to previous version?Side effects of agent actions can be reversed?Post-incident review process documented?

Test 5

Anomaly detection & alerting

Alerting on unusual token consumption?Alert on unexpected tool call patterns?On-call rotation covers AI incidents 24/7?

Outputs → Failure Mode Register AI Incident Playbook Kill Switch Test Evidence Regulatory Notification Procedure

SC

COMPLIANCE READINESS SCORECARD

Executive Summary

Use this scorecard as your audit executive summary. One row per domain — each scored against five compliance dimensions. Present this to the board before the detailed findings report.

Audit Domain

Controls Exist

Designed Well

Operating

Logged

EU AI Act

D1 — Infrastructure & Systems

PASS

GAP

PASS

GAP

D2 — AI Actions & Tool Use

GAP

FAIL

GAP

FAIL

D3 — Audit Trail & Observability

PASS

GAP

PASS

GAP

D4 — Data Governance & Privacy

GAP

FAIL

GAP

FAIL

D5 — EU AI Act Assessment

GAP

FAIL

D6 — Access & Identity

PASS

GAP

PASS

N/A

D7 — Incidents & Failures

GAP

FAIL

GAP

FAIL

Scorecard key: PASS Control exists and operating effectively | GAP Partial — improvement required | FAIL Control absent or not operating — immediate action required

Note: The scorecard above shows a representative example of findings for a typical early-stage AI deployment. Replace each cell with actual test results from your fieldwork. D5 and D4 most commonly produce FAIL ratings in first-time AI Act audits — organisations underestimate how new the GDPR and EU AI Act obligations are for AI-specific data flows.

AI AgentInfrastructureAudit

AI AGENT INFRASTRUCTURE & SYSTEMS AUDIT

AI ACTIONS & TOOL USE AUDIT

AUDIT TRAIL & OBSERVABILITY

DATA GOVERNANCE & PRIVACY

EU AI ACT COMPLIANCE ASSESSMENT

ACCESS, IDENTITY & PRIVILEGE CONTROLS

INCIDENT RESPONSE & FAILURE MODES

COMPLIANCE READINESS SCORECARD

AI Agent
Infrastructure
Audit